-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloning refactor and data race fix #393
Conversation
I was the original one to code clone, back when I wasn't very good at Go. I am sorry. Thank you for doing this.
I need zero convincing to change this out of concurrency, lol
I'm interested: how so? I'm not sure how we can check for equivalence, since boothLeastRotation is handling the circular sequences, where most time is spent.
This makes sense to me. I've always thought an iterative approach would be better.
This is somewhere I can help! I think I am the main user of this functionality, though I hope more folks will use it soon (and I use it to back some easier-to-use APIs for people). So, the idea is that I can use this for simulating GoldenGate assemblies that I am doing in the lab. The thing was, I made this functionality to truly model even the most complex ligations. _I've literally never found someone who did these complexity ligations. So it solves a problem that doesn't really exist. Really what is needed is for the function to only simulate ligation with ordered pairs of overhangs (so just checks that no two fragments share overhangs, then line them up, and put em together). Far simpler, and no need to even call seqhash. The most complex use cases for combinatorial assembly can then simply iterate through each "part type" and call the function. The most-most complex use case I built for (combinatorial and variant overhang fragments) should IMO just be removed. Nuke the code [not a request in this PR, but something I'd like to eventually do].
I like the EnzymeManager. Reviewing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One change request, which actually changes the usage of the package a little but is a feature I'd actually like to implement.
I like the changes though, and I think this is good for a merge.
clone/clone_test.go
Outdated
// Eventually, we want to get the data for this map from ftp://ftp.neb.com/pub/rebase | ||
func getBaseRestrictionEnzymes() []Enzyme { | ||
return []Enzyme{ | ||
{"BsaI", regexp.MustCompile("GGTCTC"), regexp.MustCompile("GAGACC"), 1, 4, "GGTCTC"}, | ||
{"BbsI", regexp.MustCompile("GAAGAC"), regexp.MustCompile("GTCTTC"), 2, 4, "GAAGAC"}, | ||
{"BtgZI", regexp.MustCompile("GCGATG"), regexp.MustCompile("CATCGC"), 10, 4, "GCGATG"}, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should expose this to consumers. The number of enzymes actually used in GoldenGates is super limited. Tbh, I think the one change would be to have these as a nice default that users can use, as well as adding the following enzymes:
"PaqCI": {"PaqCI", regexp.MustCompile("CACCTGC"), regexp.MustCompile("GCAGGTG"), 4, 4, "CACCTGC"},
"BsmBI": {"BsmBI", regexp.MustCompile("CGTCTC"), regexp.MustCompile("GAGACG"), 1, 4, "CGTCTC"},
"SapI": {"SapI", regexp.MustCompile("GCTCTTC"), regexp.MustCompile("GAAGAGC"), 1, 3, "GCTCTTC"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
No problem and nothing to be sorry for! I was able to jump right in and reason about.
No clue! I'd have to take a closer look and understand what's going on under the hood. I only say this because that seemed to be a hotspot for Golden Gate. If performance becomes a problem- this might be a good place to start looking.
Yeah, I think this would be worth trying out and measuring. And ah- so maybe no point in trying to optimize this any further if it might get completely changed up. |
I just wonder if there is anyway to optimize it (booth rotation algorithm is pretty good). Guessing best way is to just remove the need for it.
Yes, but this solves a race condition, so definitely worth a merge Last thing: Can you add to the changelog? Then we can go for a merge. Thanks! |
56a4a9f
to
3941e24
Compare
} | ||
} | ||
|
||
// If, on a linear sequence, the last overhang's position + EnzymeSkip + EnzymeOverhangLen is over the length of the sequence, remove that overhang. | ||
for _, overhangSet := range [][]Overhang{forwardOverhangs, reverseOverhangs} { | ||
if len(overhangSet) > 0 { | ||
if !seq.Circular && (overhangSet[len(overhangSet)-1].Position+enzyme.Skip+enzyme.OverhangLen > len(sequence)) { | ||
if !part.Circular && (overhangSet[len(overhangSet)-1].Position+enzyme.Skip+enzyme.OverheadLength > len(sequence)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could there be some explanation of the logic behind this line and what it's doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is if the recognition sequence is at the very end of a circular DNA, and then cutting on the other side (ie, the beginning of the sequence).
t.Errorf("Failure of GoldenGate on incorrect enzyme should follow the exact string `Enzyme EcoRFake not found in enzymeMap`. Got: %s", err.Error()) | ||
} | ||
} | ||
|
||
func ExampleGoldenGate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TwFlem we should have an example test for GoldenGate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to 'ExampleEnzymeManager_GoldenGate' for the linter- it's still there!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well- now it is back to ExampleGoldenGate :)
clone/clone.go
Outdated
@@ -333,14 +320,24 @@ Specific cloning functions begin here. | |||
|
|||
// GoldenGate simulates a GoldenGate cloning reaction. As of right now, we only | |||
// support BsaI, BbsI, BtgZI, and BsmBI. | |||
func GoldenGate(sequences []Part, enzymeStr string) ([]string, []string, error) { | |||
func (enzymeManager EnzymeManager) GoldenGate(sequences []Part, enzymeStr string) (openConstructs []string, infiniteLoops []string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should GoldenGate
really be a method of EnzymeManager
rather than its own function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After learning more about how Golden Gate works IRL and a little bit more about SynBio- probably not.
It seems more appropriate for Golden Gate to just accept something like a 'cuttingEnzyme' of type 'Enzyme' and have the caller lookup whatever enzyme they want however they want.
This is also nice because now Golden Gate doesn't have to return an error anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally would like the change of GoldenGate to be its own function, independent of an enzymeManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to no longer be a receiver
Changes in this PR
This is an attempt to start refactor cloning for issue #367. This will also probably address issue #276.
One thing that I noticed in the clone package was all of the concurrency happening underneath golden gate. This made it a bit harder to grock what was going on. This also messed with the correctness of the program- hence #276. Perhaps more importantly, the unbounded number of go routines spinning up for pure, unblocking compute causes performance problems. So many go routines get created that the program begins spending most of its time thrashing back and forth between goroutines rather than just computing.
Here's some cpu profiles.
This is the cpu profile with the same benchmark in the changes in this PR with the concurrency that's currently in master:
This is just with the currency removed:
You'll see that with concurrency removed, more time is spent actually running golden gate rather than coordinating goroutines and gc time. In fact, the bottleneck looks to be seqhash. There is also a large amount of time on the GC. GC time could probably be improved by taking an iterative approach- that way there isn't too much pressure on the groutine's stack to keep track of everything.
As far as what to make concurrent- I don't know enough about Poly of the field and how this actually gets used. If I had to throw out suggestions on where to best add some parallelism, it might be parallelizing multiple runs of GoldenGate or parallelize that for loop that calls to fan out the initial call to recurseLigate for each fragment and maybe synchronize updating whatever will keep track of existing seqhashes. Some discussion and experimentation could go a long ways here.
Also it could be super valuable to optimize seqhash 🤷♂️
This PR also introduces the concept of EnzymeManager. The idea here is to abstract away that global enzyme map state so consumers could have a straight forward way of managing their enzyme data throughout the lifetime of the program. For example, they could add, remove, reorder, whatever Poly would want to have EnzymeManager implement. The main impetus for this was the comment
Eventually, we want to get the data for this map from ftp://ftp.neb.com/pub/rebase
Something like the EnzymeManager could make that flexibility possible and allow consumers to load whatever data they want from wherever they want.
Honestly though, we can throw the EnzymeManager out- the other changes seem much more important.
Why are you making these changes?
starts resolving #367 and will resolve #276.
Are any changes breaking? (IMPORTANT)
The EnzymeManager related changes are breaking. This isn't important for addressing #367 and #276- this can be removed if need be.
Pre-merge checklist
All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.
primers/primers_test.go
for what this might look like.CHANGELOG.md
in the[Unreleased]
section.